{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Spelling Correction using JamSpell"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "This tutorial is available as an IPython notebook at [Malaya/example/spelling-correction-jamspell](https://github.com/huseinzol05/Malaya/tree/master/example/spelling-correction-jamspell).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "\n",
    "logging.basicConfig(level=logging.INFO)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:torch.distributed.nn.jit.instantiator:Created a temporary directory at /tmp/tmp0bb_h21m\n",
      "INFO:torch.distributed.nn.jit.instantiator:Writing /tmp/tmp0bb_h21m/_remote_module_non_scriptable.py\n",
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
     ]
    }
   ],
   "source": [
    "import malaya"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# some text examples copied from Twitter\n",
    "\n",
    "string1 = 'krajaan patut bagi pencen awal skt kpd warga emas supaya emosi'\n",
    "string2 = 'Husein ska mkn aym dkat kampng Jawa'\n",
    "string3 = 'Melayu malas ni narration dia sama je macam men are trash. True to some, false to some.'\n",
    "string4 = 'Tapi tak pikir ke bahaya perpetuate myths camtu. Nanti kalau ada hiring discrimination despite your good qualifications because of your race tau pulak marah. Your kids will be victims of that too.'\n",
    "string5 = 'DrM cerita Melayu malas semenjak saya kat University (early 1980s) and now as i am edging towards retirement in 4-5 years time after a career of being an Engineer, Project Manager, General Manager'\n",
    "string6 = 'blh bntg dlm kls nlp sy, nnti intch'\n",
    "string7 = 'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List available models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'wiki+news': {'Size (MB)': 337},\n",
       " 'wiki': {'Size (MB)': 148},\n",
       " 'news': {'Size (MB)': 215}}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "malaya.spelling_correction.jamspell.available_model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load JamSpell speller\n",
    "\n",
    "JamSpell use Norvig + Ngram words.\n",
    "\n",
    "Before you able to use this spelling correction, you need to install [jamspell](https://github.com/bakwc/JamSpell),\n",
    "\n",
    "For mac,\n",
    "\n",
    "```bash\n",
    "wget http://prdownloads.sourceforge.net/swig/swig-3.0.12.tar.gz\n",
    "tar -zxf swig-3.0.12.tar.gz\n",
    "./swig-3.0.12/configure && make && make install\n",
    "pip3 install jamspell\n",
    "```\n",
    "\n",
    "For debian / ubuntu,\n",
    "\n",
    "```bash\n",
    "apt install swig3.0 -y\n",
    "pip3 install jamspell\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "def jamspell(model: str = 'wiki+news', **kwargs):\n",
    "    \"\"\"\n",
    "    Load a jamspell Spell Corrector for Malay.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    model: str, optional (default='wiki+news')\n",
    "        Supported models. Allowed values:\n",
    "\n",
    "        * ``'wiki+news'`` - Wikipedia + News, 337MB.\n",
    "        * ``'wiki'`` - Wikipedia, 148MB.\n",
    "        * ``'news'`` - Wikipedia, 215MB.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: malaya.spell.JamSpell class\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = malaya.spelling_correction.jamspell.load(model = 'wiki')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### List possible generated pool of words\n",
    "\n",
    "```python\n",
    "def edit_candidates(self, word: str, string: List[str], index: int = -1):\n",
    "    \"\"\"\n",
    "    Generate candidates given a word.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    word: str\n",
    "    string: str\n",
    "        Entire string, `word` must a word inside `string`.\n",
    "    index: int, optional(default=-1)\n",
    "        index of word in the string, if -1, will try to use `string.index(word)`.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: List[str]\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('suka',\n",
       " 'suke',\n",
       " 'suku',\n",
       " 'duke',\n",
       " 'luke',\n",
       " 'suk',\n",
       " 'sue',\n",
       " 'suki',\n",
       " 'sake',\n",
       " 'ske',\n",
       " 'soke',\n",
       " 'sure',\n",
       " 'yuke',\n",
       " 'suko',\n",
       " 'puke')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.edit_candidates('suke', 'saya suke makan ayem'.split())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### To correct a word\n",
    "\n",
    "```python\n",
    "def correct(self, word: str, string: List[str], index: int = -1):\n",
    "    \"\"\"\n",
    "    Correct a word within a text, returning the corrected word.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    word: str\n",
    "    string: List[str]\n",
    "        Tokenized string, `word` must a word inside `string`.\n",
    "    index: int, optional(default=-1)\n",
    "        index of word in the string, if -1, will try to use `string.index(word)`.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: str\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'suka'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.correct('suke', 'saya suke makan ayem'.split())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'kpd'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "splitted = string1.split()\n",
    "model.correct('kpd', splitted)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'krajaan'"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.correct('krajaan', splitted)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### To correct a sentence\n",
    "\n",
    "```python\n",
    "def correct_text(self, text: str):\n",
    "    \"\"\"\n",
    "    Correct all the words within a text, returning the corrected text.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    text: str\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: str\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'krajaan patut bagi pencen awal set kpd warga emas supaya emosi'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.correct_text(string1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer = malaya.tokenizer.Tokenizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Husein ska mkn aym dkat kampng Jawa'"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Husein ska mkn aym dkat kampng Jawa'"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenized = tokenizer.tokenize(string2)\n",
    "model.correct_text(' '.join(tokenized))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Melayu malas ni narration dia sama je macam men are trash . True to some , false to some .'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenized = tokenizer.tokenize(string3)\n",
    "model.correct_text(' '.join(tokenized))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'DrM cerita Melayu malas semenjak saya kat University ( early 1980s ) and now as i am edging towards retirement in 4 - 5 years time after a career of being an Engineer , Project Manager , General Manager'"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenized = tokenizer.tokenize(string5)\n",
    "model.correct_text(' '.join(tokenized))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'blh yang dlm kes alp sy , nnti inch'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenized = tokenizer.tokenize(string6)\n",
    "model.correct_text(' '.join(tokenized))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'mulan slh org boleh , bila geng tuh kena salah jgk boleh trima . . pelik'"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenized = tokenizer.tokenize(string7)\n",
    "model.correct_text(' '.join(tokenized))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'mulakn slh org boleh ,bila geng tuh kena slhkn jgk xboleh trima .. pelik'"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string7"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
